For a set of nodes \(V=\{1,\ldots,n\}\) we define a set of ties as being un-ordered pairs of nodes, \(\{i,j\} \in \{i,j\in V : i\neq j\}\). We can denote the set of all undordered pairs \({V}\choose{2}\), meaning that the ties or edges are two-element subsets of \(V\), and the set of ties \(E \subseteq {{V}\choose{2}}\).
For all \(\{i,j\} \in {{V}\choose{2}}\), we define the tie-variables \(x_{ij}\) as being indicators \[ \begin{equation*} x_{ij}= \left\{ \begin{array}{lr} 1,&\text{if there is an edge between } i \text{ and } j\\ 0,&\text{else} \end{array} \right. \end{equation*} \]
We collect these in an \(n \times n\) adjacency matrix \[ X = \begin{bmatrix} x_{11} & x_{12} & x_{13} & \cdots & x_{1n}\\ x_{21} & x_{22} & x_{23} & \cdots & x_{2n}\\ x_{31} & x_{32} & x_{33} & \cdots & x_{3n}\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & x_{n3} & \cdots & x_{2n}\\ \end{bmatrix} \]
This is an introduction to social networks using built-in functions in R and the packages sna and network. We will learn the
For full details of the packages, see https://cran.r-project.org/web/packages/sna/sna.pdf and https://cran.r-project.org/web/packages/network/network.pdf. For accessile general R-help see https://www.statmethods.net/ and for any kind of errors use https://google.com.
This introduction is deliberately writen in inelegant R, using as basic functions as possible. Many packages offer sleeker and more userfriendly network routines, such as ‘igraph’. In particular, I would like to recomend the packages of David Schooch http://mr.schochastics.net/ for accessible and elegant network analysis in R. In general, basic plots in R (described in https://www.statmethods.net/graphs/index.html) are functional but more advanced and better looking plots can be acchieved through ‘ggplot’.
For basic concepts in network analysis see G. Robins (2015) and Borgatti, Everett, and Johnson (2018). There is also a handy online bool http://faculty.ucr.edu/~hanneman/nettext/ (Hanneman and Riddle 2005).
To use sna(Butts 2016) and network (Butts 2015) for the first time (uncomment the install commmands), install the packages
# install.packages("sna")
# install.packages("network")
Once packages are installed, load them
library("sna")
library("network")
Create an empty adjacency matrix for n = 5 nodes
n <- 5
ADJ <- matrix(0,n,n) # create a matrix with n rows and n columns and all values 0
Add ties \(1 \rightarrow 2\), \(1 \rightarrow 3\), \(2 \rightarrow 3\), \(3 \rightarrow 4\), and , \(4 \rightarrow 5\)
ADJ[1,2] <- 1
ADJ[1,3] <- 1
ADJ[2,3] <- 1
ADJ[3,4] <- 1
ADJ[4,5] <- 1
ADJ
To make the network undirected, add the ties \(2 \rightarrow 1\), \(3 \rightarrow 1\), \(3 \rightarrow 2\), \(4 \rightarrow 3\), and \(5 \rightarrow 4\)
ADJ[2,1] <- 1
ADJ[3,1] <- 1
ADJ[3,2] <- 1
ADJ[4,3] <- 1
ADJ[5,4] <- 1
ADJ
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 1 1 0 0
## [2,] 1 0 1 0 0
## [3,] 1 1 0 1 0
## [4,] 0 0 1 0 1
## [5,] 0 0 0 1 0
In general the cell ADJ[i,j] corresponds to the tie-variable \(X_{i,j}\). Here \(x_{1,2}=1\)
ADJ[1,2]
but, for example, \(x_{1,4}=0\)
ADJ[1,4]
The ties of node \(i=1\) is the \(i\)’th row
ADJ[1,]
The adjcenacy matrix has
dim(ADJ)
rows and columns. This means that there are \(n \times n\) cells in the adjacency matrix.
dim(ADJ)[1]*dim(ADJ)[2]
length(ADJ)
The \(n\) diagonal elements \(x_{11},x_{22},\ldots,x_{nn}\) are zero by definition, which means that there are \(n \times n - n = n(n-1)\) variables that can be non-zero, here
dim(ADJ)[1]*dim(ADJ)[2] - n
Density: How many variables are equal to 1 out of the total posible?
The total number of ones \[L = \sum_{i,j,i\neq j}x_{ij}=x_{12}+\cdots+x_{1n}+x_{21}+\cdots+x_{(n-1)n}\] is simply a count of the number of non-zero entries
sum(ADJ)
The density thus is
sum(ADJ)/(n*(n-1))
and 50% of possible ties are present in the network.
How many ties does a node have?
The degree \(d_i\) of a node \(i\) is defined as the sum \(d_i=\sum_{j}x_{i,j}=x_{i,2}+x_{i,2}+\cdots + x_{i,n}\). The degree of node \(i=1\) is thus
sum(ADJ[1,])
and the degree of node \(i=2\) is
sum(ADJ[2,])
Calculate the column sum of the adjacency matrix to get the vector of degrees (note the capital S)
colSums(ADJ)
The degree distribution is the table of frequencies of degrees
table( colSums(ADJ) )
You can chart the degree distribution with a bar chart
plot( table( colSums(ADJ) ))
You can use standard R-routines to explore the adjacency matrix
For example finding what node (-s) have, say, degree 3
which(colSums(ADJ)==3)
Or subsetting the adjacency matrix to look only at nodes with degree 2 or greater
use <- which(colSums(ADJ)>=2) # for each row there will be a logical TRUE or FALSE
ADJ[use,use]
Most network metrics can be calculated using linear algebra. For example, if \(X_{i,j}\) in \(X\) tell you if \(i\) and \(j\) are directly connected, element \((XX)_{i,j}\) of the matrix product \(XX\), tells you how many paths \(i \rightarrow k \rightarrow j\) there are
ADJ %*% ADJ
Element \((XXX)_{i,j}\) of the matrix product \(XXX\), tells you how many paths \(i \rightarrow k \rightarrow h \rightarrow j\) there are
ADJ %*% ADJ %*% ADJ
Plotting the matrix object ADJ is not meaningful because R does not know that this is an adjacency matrix. To interpret ADJ as a network, translate the adjacency matrix to a network object
net <- as.network(ADJ, directed = FALSE)
NB: in the network package you use directed=FALSE in lieu of setting mode equal to graph.
The new object net is an object of type
class(net)
While printing ADJ to screen just gives you the matrix, priniting net gives you a summary of the network
net
When plotting a network object, R knows that you want to plot the sociogram
plot( net )
For various plotting option see ?plot.network. For example, set node-size to degree, include labels, and set different colours
plot( net , # the network object
vertex.cex = degree(net) , # how should nodes (vertices) be scaled
displaylabels = TRUE, # display the labels of vertices
vertex.col = c('red','blue','grey','green','yellow'))
Note that degree(net) is a built-in function in network for calculating the degrees of the nodes. The next step will explore more of these functions.
Continue with the network that we constructed previously in Minimal Intro.
Recall, basic properties of this network are
net
## Network attributes:
## vertices = 5
## directed = FALSE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges= 5
## missing edges= 0
## non-missing edges= 5
##
## Vertex attribute names:
## vertex.names
##
## No edge attributes
What nodes have more ties?
degree( net )
This is vector with the degree centrality for each node in the network. Degree centrality is also called Freeman degree centrality (Freeman 1978).
Betweeness is related to connectivity and flow in a network (Freeman 1978; Borgatti and Everett 2006). This measure requires that you know the network structure of the entire network.
betweenness( net , gmode ='graph' )
Node 3 is on the shortest path between 4 pairs of nodes:
| Start node | Step 2 | Step 3 | Step 4 |
|---|---|---|---|
| 1 | 3 | 4 | |
| 1 | 3 | 4 | 5 |
| 2 | 3 | 4 | |
| 2 | 3 | 4 | 5 |
These nodes will only be able to ‘communicate’ via 3
A cutpoint is a node that when removed results in the network having more components. In simpler terms, a cutpoint is a node that connects otherwise dissconnected nodes
cutpoints( net , mode = "graph")
## [1] 3 4
The extent to which nodes broker between other nodes has been the fucus of a large part of Ron Burt’s career. He proposes some elaborations for measuring bokerness in Burt (2009) called constraint and effective size. These are local measures that do not, as opposed to cutpoints and betweeness, require that you know the network structure of the entire network.
The triad census tabulates different types of triangles or tripples (J. A. Davis 1967; Holland and Leinhardt 1972)
| 0 edges | 1 edge | 2 edges | 3 edges |
|---|---|---|---|
| \(\sharp\) tripels | \(\sharp\) tripels | \(\sharp\) tripels | \(\sharp\) tripels |
triad.census( net , mode = "graph")
## 0 1 2 3
## [1,] 0 6 3 1
There are 6 tripples, with one tie. For example the subgraph consisting of nodes 1,4 , and 5.
plot(as.network( ADJ[c(1,4,5),c(1,4,5)] , directed = FALSE),
vertex.cex = degree(net)[c(1,4,5)],
vertex.col = c('red','blue','grey','green','yellow')[c(1,4,5)])
Open triad: There are 3 tripples, with two ties. For example the subgraph consisting of nodes 3, 4 , and 5.
plot(as.network( ADJ[c(3,4,5),c(3,4,5)] , directed = FALSE),
vertex.cex = degree(net)[c(3,4,5)],
vertex.col = c('red','blue','grey','green','yellow')[c(3,4,5)])
Closed triad: There is 1 tripple, with three ties. For example the subgraph consisting of nodes 1, 2 , and 3.
plot(as.network( ADJ[c(1,2,3),c(1,2,3)] , directed = FALSE ),
vertex.cex = degree(net)[c(1,2,3)],
vertex.col = c('red','blue','grey','green','yellow')[c(1,2,3)] )
A \(k\)-clique is a maximally connected subgraph. This means that a \(k\)-clique is a network with \(k\) nodes that are all connected to each other. In net nodes 5 and 4 consitute a 2-clique and nodes 1, 2, and 3 constitute a 3-clique. Like the triad census, we can calulate a clique census for a graph
cc <- clique.census( net , mode = "graph")
The object cc is of class
class(cc)
The list has the following objects
names(cc)
The object clique.count lists the membership of nodes in cliques
cc$clique.count
A list of lists of cliques and their members is provided in
cc$cliques
Note that subgraphs of cliques are not listed. For example, the 2-cliques with nodes 1 and 2 is not listed as both are part of the larger 3-clique.
This is an introduction to anylisng directed networks in R using the packages sna and network. We will revisit the adjacency matrix and the notion of tie variables for directed networks.
We will learn the basic
We will use a dataset of 73 high school pupils collected by James Coleman (1964) that comes with the package `sna.
For full details of the packages, see https://cran.r-project.org/web/packages/sna/sna.pdf and https://cran.r-project.org/web/packages/network/network.pdf.
The background of, and description, of the dataset is provided in the helpfile
?coleman
Now load the network
data(coleman)
What do I have in my workspace and what did I load?
The function data() loaded something. Use the general purpose command ls() to list what is in your workspace
ls()
This is not one adjacency matrix but
class(coleman)
As described in the help file this is an array with 2 \(\times\) adjacency matrices of dimensions \(73 \times 73\)
dim(coleman)
The individual slices are matrices
class(coleman[1,,])
class(coleman[2,,])
You can print coleman[1,,] and coleman[2,,] to screen to view the adjacency matrices (but that will just be a lot of ones and zeros, in particular \(2 \times 73 \times 73\) ones and zeros). You can visualise the adjacency matrices in these matrix plots
par( mfrow = c(1,2))
plot.sociomatrix( coleman[1,,] , drawlab=FALSE , drawlines = FALSE, xlab = 'FALL')
plot.sociomatrix( coleman[2,,] , drawlab=FALSE , drawlines = FALSE, xlab = 'SPRING')
In the adjacency matrix, rows record the ties sent, and columns record the ties received. Student 1 has the out-tie variables \(x_{1,2},x_{1,3},\ldots,x_{1,n}\)
coleman[1,1,]
For example, that coleman[1,1,14] is 1 means that student 1 has a tie to student 14, and that coleman[1,1,2] is 0 means that student 1 does not have a tie to student 2.
Taking the sum of the outties
sum(coleman[1,1,])
gives us student 1’s outdegree, \(d_i^o= \sum_{j,j\neq i}x_{i,j}\).
To get the outdegree of all pupils, sum across columns for all rows
rowSums(coleman[1,,])
And, like for un-directed networks, we can chart the the degree distribution
plot( table (rowSums(coleman[1,,]) ))
The ties a student has received is the column of that particular student. Student 1 has received the tie variables \(x_{2,1},x_{3,1},\ldots,x_{n,1}\)
coleman[1,,1]
For example, that coleman[1,14,1] is 0 means that student 1 has not received a tie from student 14.
To get the indegree of all pupils sum across rows for all columns
colSums(coleman[1,,])
And, like for un-directed networks, we can chart the the indegree distribution
plot( table (colSums(coleman[1,,]) ))
The number of ties sent and received are not the same!
For example, student 1 has nominated 5 other people but student 1 has not been nominated back at all. Plotting the outdegrees against the indegrees
plot( jitter(rowSums(coleman[1,,]) ) , jitter( colSums(coleman[1,,]) ) )
reveal that some send more than they receive, and some send fewer than they receive.
One reason for this is that some pairs are asymmetric and other pairs are symmetric in their choices (there are of course other possible explanations Igarashi and Koskinen 2020)
A dyad is a pair of nodes and their ties \((X_{i,j}X_{j,i})\) to each other. To investigate whether pairs are symmetric or asymmetric, consider e.g. that 1 nominated 14 but 14 did not nominate 1 back - this pair is asymmetric.
coleman[1,1,14]
coleman[1,14,1]
The dyad 19 and 4
coleman[1,4,19]
coleman[1,19,4]
is symmetric, or reciprocal - we call this a mutual dyad.
In general we can distinguish between dyads that are Mutual, Assymetric, and Null (Holland and Leinhardt 1972):
par ( mfrow = c(1,3))
plot( as.network(coleman[1,c(4,19),c(4,19)] ),main='Mutual' ,label = c(4,19), usecurve = TRUE, edge.curv = 0.01, arrowhead.cex = 2.5, vertex.cex = degree(coleman[1,,])[c(4,19)] )
plot( as.network(coleman[1,c(1,14),c(1,14)] ),main='Assymetric' ,label = c(1,14), arrowhead.cex = 2.5, vertex.cex = degree(coleman[1,,])[c(1,14)] )
plot( as.network(coleman[1,c(1,2),c(1,2)] ),main='Null' ,label = c(1,2) , vertex.cex = degree(coleman[1,,])[c(1,2)] )
Recall that the inner product of a matrix \(X\) with itself \(X\) has entries \((XX)_{i,j}=\sum_{k}X_{i,k} X_{ k,j }\). Consequently, \(\mathrm{tr}(XX)\) gives you twice the number of reciprocated dyads. There is no standard command for trace in R, but you can take the sum() of the diagonal diag() of the matrix product, e.g. for adjacency matrix A, sum( diag( A %% A )). The standard product in R is the dot product \(X \odot X^{\top}\) where \((X \odot X^{\top})_{i,j}=X_{i,j} X_{ i,j }\). Consequently for adjacency matrix A, A * t(A) is the matrix of reciprochated ties.
The dyad census tabulates the number of dyads that are Mutual, Assymetric, and Null
dyad.census(coleman[1,,])
## Mut Asym Null
## [1,] 62 119 2447
Note that in the Coleman example, the total number of M, A, and N dyads sum up to
62 + 119 + 2447
which is exactly equal to \(n(n-1)/2=73\times72/2=2628\) - this is the total number on (un-ordered) pairs, the number of cells in the upper diagonal of the adjacency matrix.
For tripplets in directed networks, there many different kinds. For open triads, we could for example consider the three different types you can achieve with 0 Mutual dyads, 2 Assymetric dyads, and 1 Null dyad.
par( mfrow = c(1,3))
plot( as.network( matrix(c(0,1,0,0,0,1,0,0,0),3,3 ,byrow=TRUE) ), coord = matrix(c(0,0,5,10,10,0) ,3,2,byrow=TRUE) ,arrowhead.cex = 2.5, vertex.cex =3 , jitter = FALSE, main='021C')
plot( as.network( matrix(c(0,1,0,0,0,0,0,1,0),3,3 ) ), coord = matrix(c(0,0,5,10,10,0) ,3,2,byrow=TRUE) ,arrowhead.cex = 2.5, vertex.cex =3 , jitter = FALSE, main='021D')
plot( as.network( matrix(c(0,0,0,1,0,1,0,0,0),3,3 ) ), coord = matrix(c(0,0,5,10,10,0) ,3,2,byrow=TRUE) ,arrowhead.cex = 2.5, vertex.cex =3 , jitter = FALSE, main='021U')
The triples are labeled by the number of Mutual, Assymetric, and Null dyads - the so-called MAN labeling scheme. Here, the ‘C’, ‘D’, and ‘U’ distinguish between ‘Cyclic’, ‘Down’, and ‘Up’.
For closed triands consider the Transitive (‘T’), Cyclic (‘C’), and complete closed triads
par( mfrow = c(1,3))
plot( as.network( matrix(c(0,1,0,0,0,1,1,0,0),3,3 ,byrow=TRUE) ), coord = matrix(c(0,0,5,10,10,0) ,3,2,byrow=TRUE) ,arrowhead.cex = 2.5, vertex.cex =3 , jitter = FALSE, main='030C')
plot( as.network( matrix(c(0,1,1,0,0,0,0,1,0),3,3 ) ,byrow=TRUE), coord = matrix(c(0,0,5,10,10,0) ,3,2,byrow=TRUE) ,arrowhead.cex = 2.5, vertex.cex =3 , jitter = FALSE, main='030T')
plot( as.network( matrix(c(0,1,1,1,0,1,1,1,0),3,3 ) ,byrow=TRUE), coord = matrix(c(0,0,5,10,10,0) ,3,2,byrow=TRUE) ,arrowhead.cex = 2.5, vertex.cex =3 , jitter = FALSE, main='300', usecurve = TRUE, edge.curv = 0.01)
When considering the potential interpretation of these closed triads, imagine scenarios where ties reflect ‘liking’, ‘look up to’, ‘give orders’, ‘passing on information’, and how these structures would be reflected in status, chain of command, access to information, etc.
In total there are 16 types of triangles, all of which are labelled using the MAN labeling scheme (Holland and Leinhardt 1972). For the Coleman data, we calulate the triad census as
triad.census(coleman[1,,])
## 003 012 102 021D 021U 021C 111D 111U 030T 030C 201 120D 120U 120C 210
## [1,] 50171 7384 3957 64 121 128 139 70 23 1 20 43 10 9 34
## 300
## [1,] 22
Note that in the Coleman example, the total number triads
sum(triad.census(coleman[1,,]))
which is exactly equal to \[\binom{n}{3}= \frac{n(n-1)(n-1)}{(3\times 2)}=73\times72\times71/6=62196\] - this is the total number on (un-ordered) tripplets, the number of ways in which you can chose 3 element subsets out of a 73 element set.
Plotting the sociogram of the network does not differ from the undirected case, with the exception that the ties are now represented by arrows.
plot( as.network( coleman[1,,] , directed= TRUE), # the network object
vertex.cex = degree(coleman[1,,], cmode = 'indegree')/5 + .2 )
The size of a node here is set proportional to the indegree centrality.
There are two large ‘clusters’ of nodes that are connected within but not between - these are two components
Using the data we downloaded and formatted in the
Data-Formatting.Rmd vignette, we will here investigate how
much these (undirected) networks differ from random networks.
Analogously, we might ask what features of the networks are not guided
by chance.
Make sure that you have loaded the required libraries but also load
ergm()
library("ergm")
load(url("https://raw.githubusercontent.com/johankoskinen/CHDH-SNA/main/data/undirected.RData"))
ls()#list all the objects you downloaded
## [1] "ADJ" "advice.adj" "attributes"
## [4] "cc" "cd" "CoKaMe"
## [7] "CoKaMeAdvice" "CoKaMeAttributes" "CoKaMeDisc"
## [10] "CoKaMeFriend" "coleman" "coord"
## [13] "deg.dist" "degree.dist" "DistanceHospital"
## [16] "EIES1" "EIES1.net" "EIES2"
## [19] "EIES2.net" "EIESATT" "g"
## [22] "gd" "greekBomb" "greekBomb.net"
## [25] "kapf" "KAPFTS1" "KAPFTS1.net"
## [28] "KAPFTS2" "KAPFTS2.net" "lgc"
## [31] "like3bin.net" "Liking1" "Liking2"
## [34] "Liking3" "Liking3Bin" "max.deg"
## [37] "n" "N" "net"
## [40] "net.ave.dist" "net.centralisation" "net.closed.triad"
## [43] "net.components" "net.dens" "net.large.comp"
## [46] "net.open.triad" "net.profiles" "net.size"
## [49] "net.ties" "netnames" "Noordin"
## [52] "Noordin.net" "padgbus.net" "padgetbusines"
## [55] "padgettbus" "padgettmar" "padgmar.net"
## [58] "RDCON" "RDCON.net" "RDGAM"
## [61] "RDGAM.net" "RDNEG" "RDNEG.net"
## [64] "RDPOS" "RDPOS.net" "row.index"
## [67] "sage.net" "SagemanAdjacencyMatrix" "SagemanAtt"
## [70] "sampson" "SizeHospital" "temp"
## [73] "TransBin" "TransBin.net" "tribes.all"
## [76] "tribesNeg" "tribesNeg.net" "tribesPos"
## [79] "tribesPos.net" "TypeHospital" "use"
## [82] "valuedTransfer" "vertex.sides" "Webster"
## [85] "Webster.net" "WebsterAtt" "wiring"
## [88] "Wiring" "WTN" "WTN.net"
## [91] "WTNmatrix" "Zachary" "ZacharyBinary"
## [94] "ZacharyBinary.net" "ZacharyValued"
The object
net.profilescontains a number of formatted networks along with some network summary measures
names(net.profiles)
## [1] "netnames" "net.size" "net.ties"
## [4] "net.dens" "net.open.triad" "net.closed.triad"
## [7] "net.centralisation" "net.components" "net.large.comp"
## [10] "net.ave.dist"
We let the set of \(n\) nodes be \(V = \{1,\ldots,n \}\), and we define the symmetric adjacency matrix \(X=(X_{ij})_{ij \in V^{(2)}}\), where the non-redundant pairs are \(\mathcal{N} = { V \choose 2}\), with elements
\[ X_{ij}= \left\{ \begin{array}{lr} 1,&\text{if there is a tie from } i \text{ to } j, i,j \in V\\ 0,&\text{else} \end{array} \right. {\text{,}} \]
Assume that we independently for each pair, \(\{i,j\}\), decided if the tie existed by tossing a fair coin. This means that, independently for all \(\{i,j\} \in \mathcal{N}\) \[ \Pr (X_{ij}=1) = 0.5 \] In sna, the function rgraph generates random graphs.
Generate a \(n=16\) completely random graph and compare it to the Padgett and Ansell (1993) Florentine families business network
We will generate a network that only has \(n\) in common with the Padgett data set.
row.index <- which(net.profiles$netnames=='PadgetB')# the row in the matrix containing Padgett summaries
n <- net.profiles$net.size[ row.index ]
g <- rgraph(n,#size of the network
m=1,# number of networks to generate
tprob=0.5,# tie probability
mode="graph") # undirected (graph) or directed (digraph)
g.net <- as.network(g,directed=FALSE)# 'sna' give you a matrix
gden( padgbus.net )# observed density
gden(g.net)# density
Now we can compare what a network would look like if ties were formed completely at random with our observed network.
par(mfrow=c(1,2))
plot(padgbus.net)
plot(g.net)
Does it look more ‘random’ than the Padgett network?
This is only one random network, maybe it looks different by chance. Generate \(m = 100\) networks
row.index <- which(net.profiles$netnames=='PadgetB')# the row in the matrix containing Padgett summaries
n <- net.profiles$net.size[ row.index ]
g100 <- rgraph(n,#size of the network
m=100,# number of networks to generate
tprob=0.5,# tie probability
mode="graph") # undirected (graph) or directed (digraph)
Any metric that we calculate for the observed network we can calculate for a random network
par( mfrow = c(1,3) )
hist( gden( g100) , # the density for each of the 100 random networks
xlim=range( gden( g100 ),
gden( padgbus.net ) ) , main='density')
abline( v=gden( padgbus.net ),col='red' )# the observed density for reference
row.index <- which(net.profiles$netnames=='PadgetB')# the row in the matrix containing Padgett summaries
degrees <-degree(g100,g=c(1:100), cmode='indegree')# you can calculate the degree distributions for all graphs
deg.sist <- cbind( matrix( colSums(degrees==0),100,1),
t(apply(degrees,2,function(x) tabulate(x, nbins=max.deg) ) ) )# when tabulating we need to add the number of isolates
matplot(c(0:(max.deg)), t(deg.sist) ,
type ='l',
main='degree distribution' )
lines(c(0:(max.deg)),degree.dist[row.index,c(2:(max.deg+2))],col='grey',lwd=3)
triads <- triad.census( g100 , mode='graph')[,4]# CHANGE
hist(triads ,
xlim= range( triads , net.profiles$net.closed.triad[ row.index ] ),
main = 'closed triads')
abline( v=net.profiles$net.closed.triad[ row.index ],col='red' )# the observed density for reference
Clearly, since the random networks are too dense, the degree distributions are also different from the observed network, with nodes on average having a higher degree. But
despite how dense the networks are, there are too many triangles but is that because we have so many more ties???
Will a random network with density \(0.5\) resemble any networks?
Plot the densities for the observed networks as a function of size
plot( net.profiles$net.size[order(net.profiles$net.size)],
net.profiles$net.dens[order(net.profiles$net.size)],
type='b',
xlab = 'network size',ylab = 'density',ylim=range(net.profiles$net.dens,.55))
abline(h=0.5, col='red')# reference line for the completely random graphs
fixing the density at 0.5 irrespective of network size give unrealisically dense networks
If we do not get density right, we won’t get anything righ. We introduce a random graph that fixes the density.
Let \(L(x) = \sum_{i<j}x_{ij}\) be the number of edges in a graph. The number of graphs with exactly \(k\) edges \[ \mathcal{X}_k= \{ x \in \mathcal{X} : L(x)=k \} \] is given by \({ M \choose k}\), where \(M= { n \choose 2}\), and consequently the conditional distribution \[ \Pr(X=x \mid L(x)= k ) = \frac{1}{{ M \choose k}} \]
Note that are no longer deciding independently for each pair if there is a tie or not, rather we select \(k\) pairs at random from the total set of pairs and decide that they will have tie.
We refer to this model as \(X \thicksim U \mid L(X)= k\).
In sna, the function rgnm generates random
graphs with exactly the same number of ties as the observed network
(same density)
row.index <- which(net.profiles$netnames=='PadgetB')# the row in the matrix containing Padgett summaries
cg100 <- rgnm(n =100,# number of networks to generate
nv=net.profiles$net.size[ row.index ],#size of the network
m = net.profiles$net.ties[ row.index ],# number of edges
mode="graph") # undirected (graph) or directed (digraph)
Plot the observed network and one of the simulated networks
par(mfrow=c(1,2))
plot(padgbus.net)
plot(as.network(cg100[1,,],directed=FALSE ))
Does the random graph look more like the empirical graph?
Any metric that we calculate for the observed network we can calculate for a random network. Calculate density, degree distribution, and triangles for all 100 networks and plot and compare
par( mfrow = c(1,3) )
hist( gden( cg100) , # the density for each of the 100 random networks
xlim=range( gden( cg100 ),
gden( padgbus.net ) ) , main='density')
abline( v=gden( padgbus.net ),col='red' )# the observed density for reference
row.index <- which(net.profiles$netnames=='PadgetB')# the row in the matrix containing Padgett summaries
degrees <-degree(cg100,g=c(1:100), cmode='indegree')# you can calculate the degree distributions for all graphs
deg.sist <- cbind( matrix( colSums(degrees==0),100,1),
t(apply(degrees,2,function(x) tabulate(x, nbins=max.deg) ) ) )# when tabulating we need to add the number of isolates
matplot(c(0:(max.deg)), t(deg.sist) ,
type ='l',
main='degree distribution' )
lines(c(0:(max.deg)),degree.dist[row.index,c(2:(max.deg+2))],col='grey',lwd=3)
triads <- triad.census( cg100 , mode='graph' )[,4]
hist(triads ,
xlim= range( triads , net.profiles$net.closed.triad[ row.index ] ),
main = 'closed triads')
abline( v=net.profiles$net.closed.triad[ row.index ],col='red' )# the observed density for reference
The density is constant but
The degree distributions have completely different shape to the observed (grey) and few of the simulated networks have many triangles
netnames
## [1] "PadgetB" "PadgetM" "TribesPos" "TribesNeg" "WireGam" "WireCon"
## [7] "WirePos" "WireNeg" "Karate" "Sageman" "Greek" "Webster"
## [13] "Noordin" "KapfS1" "KapfS2" "EIES1" "EIES2"
# Pick one
net <- tribesPos.net# to simply rename temporarily
row.index <- which(net.profiles$netnames=='TribesPos')# the row in the matrix containing Padgett summaries
cg100 <- rgnm(n =100,# number of networks to generate
nv=net.profiles$net.size[ row.index ],#size of the network
m = net.profiles$net.ties[ row.index ],# number of edges
mode="graph") # undirected (graph) or directed (digraph)
Plot the observed network and one of the simulated networks
par(mfrow=c(1,2))
plot(net)
plot(as.network(cg100[1,,],directed=FALSE ))
Calculate density, degree distribution, and triangles for all 100 networks and plot and compare
par( mfrow = c(1,3) )
hist( gden( cg100) , # the density for each of the 100 random networks
xlim=range( gden( cg100 ),
gden( net ) ) , main='density')
abline( v=gden( net ),col='red' )# the observed density for reference
degrees <-degree(cg100,g=c(1:100), cmode='indegree')# you can calculate the degree distributions for all graphs
deg.sist <- cbind( matrix( colSums(degrees==0),100,1),
t(apply(degrees,2,function(x) tabulate(x, nbins=max.deg) ) ) )# when tabulating we need to add the number of isolates
matplot(c(0:(max.deg)), t(deg.sist) ,
type ='l',
main='degree distribution' )
lines(c(0:(max.deg)),degree.dist[row.index,c(2:(max.deg+2))],col='grey',lwd=3)
triads <- triad.census( cg100 , mode='graph' )[,4]
hist(triads ,
xlim= range( triads , net.profiles$net.closed.triad[ row.index ] ),
main = 'closed triads')
abline( v=net.profiles$net.closed.triad[ row.index ],col='red' )# the observed density for reference
The conditional uniform \(X \thicksim U
\mid L(X)= k\) distribution does not seem to do a good job of
clustering. Let us check for all of our networks prepared in
Data-Formatting.Rmd:
par(mfrow=c(1,3))
BigT <- apply(net.profiles[,c(2,3)],1,function(x) {
triad.census( rgnm(n =30,# number of networks to generate
nv=x[1],#size of the network
m = x[2],# number of edges
mode="graph") # undirected (graph) or directed (digraph)
, mode="graph")[,4]# important 'graph' giving the 4 triads
})
boxplot(BigT[,order(net.profiles$net.size)],#the raw triad counts ordered by network size
ylim=range(BigT,net.profiles$net.closed.triad),
main='triangles',clab='network',ylab='raw counts')
lines(net.profiles$net.closed.triad[ order(net.profiles$net.size) ],type='b',col='red')
ave.T <- colMeans(BigT)
BigT.norm <- apply(BigT,2, function(x) x/mean(x))# normalize
boxplot(BigT.norm[,order(net.profiles$net.size)],# the normalised triad counts ordered by network size
ylim=range(BigT.norm,net.profiles$net.closed.triad/ave.T ),
main='triangles',clab='network',ylab='centered counts')
lines(net.profiles$net.closed.triad[order(net.profiles$net.size)]/ave.T[order(net.profiles$net.size)] ,type='b',col='red')
BigC <- apply( net.profiles[,c(2,3)],1,function(x) {
centralization( rgnm(n =30,# number of networks to generate
nv=x[1],#size of the network
m = x[2],# number of edges
mode="graph") # undirected (graph) or directed (digraph)
, degree, mode="graph")
})
boxplot(BigC[,order(net.profiles$net.size)],# the centrality scores ordered by network size
ylim=range(BigC,net.profiles$net.centralisation),ylab='frequency',xlab='network',main='Centralization')
lines(net.profiles$net.centralisation[order(net.profiles$net.size)],# add observed value for reference
type='b',col='red')
Clearly, density is not the only thing that matters in tie-formation because no simulated networks produce enough triangles
The Bernoulli graph assumes that independently for each dyad \(\{i,j\} \in \mathcal{N}\), a tie is formed with probability \(p_{ij} \in (0,1)\) \[ \Pr(X_{ij}=1 ) = p_{ij} \] It is a homogenous Bernoulli graph, or Erdos-Reyni graph, if \(p_{ij}=p\) for all \(\{i,j\} \in \mathcal{N}\). We say that \[ X \thicksim Bern(p) \] for a homogenous Bernoulli model with tie-probability \(p\).
A particular form of Bernoulli graphs are stochastic blockmodels, where the tie-probability for \(\[ i,j \}\) depends on class memberships of the nodes \(i\) and \(j\).
Assuming that \(X \thicksim U \mid L(X)= k\) did produce enough triangles and gave the wrong shape for the degree distribution. Assuming \(X \thicksim Bern(p)\), the average density of graphs will be equal to that of the observed network \(x_{\mathrm{obs}}\) if \[ p= \frac{L(x_{\mathrm{obs}})}{M} \] but the variance in \(L(X)\) will not be zero as in \(X \thicksim U \mid L(X)= k\).
Will introducing more variability produce more triangles and more skewed degree distriubions?
row.index <- which(net.profiles$netnames=='PadgetB')# the row in the matrix containing Padgett summaries
ber100 <- rgraph(net.profiles$net.size[ row.index ],#size of the network
m=100,# number of networks to generate
tprob=net.profiles$net.dens[ row.index ],# tie probability equal to observed density
mode="graph") # undirected (graph) or directed (digraph)
Plot the observed network and one of the simulated networks
net <- padgbus.net# to simply rename temporarily
par(mfrow=c(1,2))
plot(net)
plot(as.network(ber100[1,,],directed=FALSE ))
gden(net)# observed density
## [1] 0.125
gden(ber100[1,,])# density of a simulated network
## [1] 0.075
Does the random graph look more like the empirical graph?
Any metric that we calculate for the observed network we can calculate for a random network. Calculate density, degree distribution, and triangles for all 100 networks and plot and compare
par( mfrow = c(1,3) )
hist( gden( ber100 ) , # the density for each of the 100 random networks
xlim=range( gden( ber100 ),
gden( net ) ) , main='density')
abline( v=gden( net ),col='red' )# the observed density for reference
degrees <-degree(ber100,g=c(1:100), cmode='indegree')# you can calculate the degree distributions for all graphs
deg.sist <- cbind( matrix( colSums(degrees==0),100,1),
t(apply(degrees,2,function(x) tabulate(x, nbins=max.deg) ) ) )# when tabulating we need to add the number of isolates
matplot(c(0:(max.deg)), t(deg.sist) ,
type ='l',
main='degree distribution' )
lines(c(0:(max.deg)),degree.dist[row.index,c(2:(max.deg+2))],col='grey',lwd=3)
triads <- triad.census( ber100 , mode='graph' )[,4]
hist(triads ,
xlim= range( triads , net.profiles$net.closed.triad[ row.index ] ),
main = 'closed triads')
abline( v=net.profiles$net.closed.triad[ row.index ],col='red' )# the observed density for reference
The degree distributions are more forgiving (but all have wrong shape) but still too few triangles
To get an idea just of how unlikely it is that a Bernoulli graph could have produced \(5\) triangles (as in Padgett’s data) or more is
mean( triads >= 5 )
## [1] 0.02
distances <- apply(ber100,
1,
function(x) tabulate( geodist( x )$gdist[upper.tri(geodist( x )$gdist)] , nbins=10 ) )
matplot(c(1:(10)), distances ,
type ='l',
main='geodesic distribution' ,
xlab='distance',
ylab='frequency')
lines(c(1:(10)),
tabulate( geodist( net )$gdist[upper.tri(geodist( net )$gdist)] , nbins=10 ),
pch=24,bg="blue",lty=1,type='b',
col='red',lwd=3 )
The model \(X \thicksim Bern(p)\) does great in capturing the distances - Bernoulli graphs have short pathlengths. It achieves this by sacrifing clustering though. If there is a choice between ‘linking to someone new’ or ‘linking back to someone you indirectly know’, it choses the new one - because ties form idependenlty, the Bernoulli tie-probability does not know whom a node already has ties to.
Bernoulli graphs do not waste ties on clustering - this creates reach
Interpreting this in terms of tie-formation processes:
Independence means that a tie is more likely to a ‘new’ person than to ‘a friend of a friend’ because Bernoulli does not know (independence) whom your friends already are
Clearly, we cannot assume that people form ties independently of each other!
Let us see how \(X \thicksim Bern(p)\) does for all networks.
par(mfrow=c(1,3))
BigT <- apply(net.profiles[,c(2,4)],1,function(x) {
triad.census( rgraph(n = x[1],
m =30,# number of networks to generate
tprob= x[2], # density of network
mode="graph") # undirected (graph) or directed (digraph)
, mode="graph")[,4]# important 'graph' giving the 4 triads
})
boxplot(BigT[,order(net.profiles$net.size)],#the raw triad counts ordered by network size
ylim=range(BigT,net.profiles$net.closed.triad),
main='triangles',clab='network',ylab='raw counts')
lines(net.profiles$net.closed.triad[ order(net.profiles$net.size) ],type='b',col='red')
ave.T <- colMeans(BigT)
BigT.norm <- apply(BigT,2, function(x) x/mean(x))# normalize
boxplot(BigT.norm[,order(net.profiles$net.size)],# the normalised triad counts ordered by network size
ylim=range(BigT.norm,net.profiles$net.closed.triad/ave.T ),
main='triangles',clab='network',ylab='centered counts')
lines(net.profiles$net.closed.triad[order(net.profiles$net.size)]/ave.T[order(net.profiles$net.size)] ,type='b',col='red')
BigC <- apply(net.profiles[,c(2,4)],1,function(x) {
centralization( rgraph(n = x[1],
m =30,# number of networks to generate
tprob= x[2], # density of network
mode="graph") # undirected (graph) or directed (digraph)
, degree, mode="graph")
})
boxplot(BigC[,order(net.profiles$net.size)],# the centrality scores ordered by network size
ylim=range(BigC,net.profiles$net.centralisation),ylab='frequency',xlab='network',main='Centralization')
lines(net.profiles$net.centralisation[order(net.profiles$net.size)],# add observed value for reference
type='b',col='red')
The extra variability offered by \(Bern(p)\) does a little to improve the features of the simulated networks but much of the clustering and centralisation remains unexplained.
Since it is clear that \(Bern(p)\) does not capture the degree distribution of the typical observed network. This is not surprising given that the tie-probabilities for all dyads is the same and thereby for all nodes - all nodes should have on average the same degree.
To see how much of the structure of a network we can explain through the degree distribution we introduce yet another model.
Like the \(U \mid L(X)=k\) distribution, we define a distribution for graphs (not for tie-variables). We let \(X \thicksim U \mid X_{\cdot+}=d\) mean that the distribution is uniform on all the graphs that have the exact same degree distribution. Thus \[ \Pr( X = x) = \left\{ \begin{array}{lr} c^{-1},&\text{if } x_{\cdot+}=d\\ 0,&\text{else} \end{array} \right. {\text{,}} \] where \(x_{\cdot+}=(\sum_j x_{1j},\ldots,x_{nj})^{\top}\) is the vector of degrees, and \(c = \sharp \{ x \in \mathcal{X}: x_{\cdot+}=d \}\).
There are several algorithms for simulating networks with fixed degree distribution and all of (the reliable ones) rely on Markov chain Monte Carlo (MCMC).
The premise of MCMC is the you want to draw one variable \(x\) from a distribution \(p(x)\) but you cannot do that directly. What you can do though is to say how much more likely one realisation of a variable is relative to another. For example, let \(x\) and \(y\) be two networks and that the target distribution is \(p(x)\). What we need to know in order to use MCMC is the ratio \(p(x)/p(y)\) for all \(x,y \in \mathcal{X}\). This allows us to generate a sequence of realisations \[ x_0,x_1,x_2,\ldots,x_K \] of outcomes. This is an iterative scheme where you have an updating rule that takes an outcome \(x_t\) and updates it to \(x_{t+1}\), and this update relies on our ability to calculate \(p(x)/p(y)\) for all \(x,y \in \mathcal{X}\). In particular, if we are in state \(x_t\), and we consider moving to state \(x^{\ast}\), we know how much more or less likely the new state is relative to the old state \(p(x^{*})/p(x_t)\). For networks, these updates are often incremental and you either set \(x_{t+1}:=x^{\ast}\) or \(x_{t+1}:=x_{t}\), depending on \(p(x^{*})/p(x_t)\). Consequently, sometimes we stay in the same state, in the same network, for many iterations.
Burnin: to make sure that \(x_t\) has ‘forgotten’ the initial state \(x_0\), you usuall allow for a number of interations to pass - this is the burnin period.
Thinning: If you have an adequate burning and want to draw many outcomes, you may not have to do the same burnin again, instead you allow for a certain number of iterations bewtween sucessive sample points - the number of iterations you wait is called thinning.
g.udegs <- simulate(padgbus.net~edges,
coef=c(0),
constraints=~degreedist,# this is what does it degreedist is based on padgbus.net
nsim=100,
control=control.simulate(MCMC.burnin=100000))# you need to bump up the default burnin, otherwise the networks are too similar to the starting point, the observed network
To confirm that this network actually has the same degree distribution:
par(mfrow=c(1,3))
deg1 <- degree(padgbus.net, cmode='indegree')
deg2 <- degree(g.udegs[[5]], cmode='indegree')# this is a list of networks, not an array, hence [[k]]
plot( deg1,deg2 ,xlab='Padgett', ylab ='random network',main='degrees')#
plot( deg1[order(deg1)],deg2[order(deg2)] ,ylab ='random network',main='degrees')# the degree distribution is the same but not the same nodes have the same degrees
plot( table(deg1), type='l',col='grey',lwd=4)
lines( table(deg2) ,col='red' )
From the first panel we see that not every node has the same degree in the random network as in the observed network. However, the degree distribution is the same, as seen in thee right-hand panel. The middle panel shows that there is a mapping from the nodes in \(x_{\mathbf{obs}}\) to the nodes in \(x\), so that if for every node with degree \(k\) in the former, there is exactly one node in the latter that has degree \(k\).
Plot the observed network and then three of the random networks
par( mfrow =c(1,4) )
plot( padgbus.net )
plot( g.udegs[[5]] )
plot( g.udegs[[50]] )
plot( g.udegs[[100]] )
The simulated networks look like pretty faithful represeantions of the observed network
Since the degree distribution is identical to the observed network, the density will also be preserved for all simulated networks. Let us look at triangles and geodesic distances.
par( mfrow = c(1,3) )
row.index <- which(net.profiles$netnames=='PadgetB')# t
triads <- triad.census( g.udegs , mode="graph")[,4]
hist(triads ,
xlim= range( triads , net.profiles$net.closed.triad[ row.index ] ),
main = 'closed triads')
abline( v=net.profiles$net.closed.triad[ row.index ],col='red' )# the observed density for reference
triads <- triad.census( g.udegs ,mode="graph" )[,3]
hist(triads ,
xlim= range( triads , net.profiles$net.open.triad[ row.index ] ),
main = 'open triads')
abline( v=net.profiles$net.open.triad[ row.index ],col='red' )# the observed density for reference
# because g.udegs is a list it is a little more complicated to use built in sna functions
distances <- matrix(0,10,100)
for (k in c(1:100))
{
distances[,k] <- tabulate( geodist( g.udegs[[k]] )$gdist[upper.tri(geodist( g.udegs[[k]] )$gdist)],
nbins=10 )
}
matplot(c(1:(10)), distances ,
type ='l',
main='geodesic distribution' ,
xlab='distance',
ylab='frequency')
lines(c(1:(10)),
tabulate( geodist( padgbus.net )$gdist[upper.tri(geodist( padgbus.net )$gdist)] , nbins=10 ),
pch=24,bg="blue",lty=1,type='b',
col='red',lwd=3 )
The \(X \thicksim U \mid X_{\cdot+}=d\) model does not fully capture clustering and it over-produces open triangles.
Even if we could model the degree of every node exactly, we cannot produce the same clustering that we see for real networks
The degree distribution alone is not sufficient for explaining clustering and reach - people do not only care about how many friends they have
An introduction to ERGMs is given in the slides. Formally, an ERGM is a model \[ \Pr( X = x \mid \theta) = \frac{ \exp \{ \theta^{\top}z(x) \} }{ \sum_{x \in \mathcal{X} } \exp \{ \theta^{\top}z(x) \} } \]
where * \(\theta\) is a vector of statitical parameters * \(z(x)\) is a vector of statistics calculated on \(x\) * the denominator is a sum over all graphs ensuring that the probaility sums to one
If \(z(x)=L(x)\), the only model statistic is the number of edges in the graph. This model is equivalent to \(Bern(p)\), where \[ \theta = - \log\left( \frac{1}{p-1}\right) \] or, expressed in terms of \(p\) \[ p= \frac{e^{\theta L(x)}}{1+e^{\theta L(x)}} \]
A property of ERGM is that you can set the values of the expected statistics. The expected statistics \[ E_{\theta} \{ z(X) \} = \sum_{x \in \mathcal{x}} z(x)\Pr( X = x \mid \theta) \] are completely determined by the parameters.
Simulate networks from an ERGM with only the statistic \(z(x)=L(x)\) and paramter \(\theta = −1.945\)
g.ergm.bern <- simulate(padgbus.net~edges,# one statistic
coef= -1.945,# one parameter
nsim=100,
control=control.simulate(MCMC.burnin=100000))# you need to bump up the default burnin, otherwise the networks are too similar to the starting point, the observed network
Inspect networks
par( mfrow = c(1,2) )
hist( gden( g.ergm.bern ) , # the density for each of the 100 random networks
xlim=range( gden( g.ergm.bern ),
gden( padgbus.net ) ) , main='density')
abline( v=gden( padgbus.net ),col='red' )# the observed density for reference
triads <- triad.census( g.ergm.bern , mode='graph' )[,4]
hist(triads ,
xlim= range( triads , triad.census( padgbus.net , mode='graph' )[,4] ),
main = 'closed triads')
abline( v= triad.census( padgbus.net , mode='graph' )[,4],col='red' )# the observed density for reference
The denisty of graphs is centred right over the observed value! Because this model is equivalent to \(Bern(p)\), the missfit of closed triads is the same as before.
The Bernoulli ERGM assumes that ties form independently - how can we relax that?
Since the Bernoulli ERGM does not reproduce the number of triangles, how do can we improve on that?
If we add statistics for the number of triangles we can increase the number of triangles that the model produces. Let us add statistics:
We will use the parameters \(\theta = \left( -2.1113 ,1.0465 ,-0.6318, 1.3064 \right)^{\top}\)
g.ergm.markov <- simulate(padgbus.net~edges+kstar(2) +kstar(3) + triangles,# one statistic
coef= c(-4.4878 , 1.2556 , -0.7059 , 1.0266),# one parameter
nsim=100,
control=control.simulate(MCMC.burnin=100000))# you need to bump up the default burnin, otherwise the networks are too similar to the starting point, the observed network
Inspect networks
par( mfrow = c(2,2) )
hist( gden( g.ergm.markov ) , # the density for each of the 100 random networks
xlim=range( gden( g.ergm.markov ),
gden( padgbus.net ) ) , main='density')
abline( v=gden( padgbus.net ),col='red' )# the observed density for reference
triads <- triad.census( g.ergm.markov , mode='graph' )
obsetriad <- triad.census(padgbus.net, mode='graph' )
hist(triads[,2] ,
xlim= range( triads[,2] , obsetriad[2] ),
main = 'single edge triads')
abline( v= obsetriad[2],col='red' )# the observed density for reference
hist(triads[,3] ,
xlim= range( triads[,3] , obsetriad[3] ),
main = 'open triads')
abline( v= obsetriad[3],col='red' )# the observed density for reference
hist(triads[,4] ,
xlim= range( triads[,4] , obsetriad[4] ),
main = 'closed triads')
abline( v= obsetriad[4],col='red' )# the observed density for reference
The Markov model manages to reproduce all of these local, structural statistics
Let us look at a global property, like geodesic distances
row.index <- which(net.profiles$netnames=='PadgetB')# t
# because g.udegs is a list it is a little more complicated to use built in sna functions
distances <- matrix(0,10,100)
for (k in c(1:100))
{
distances[,k] <- tabulate( geodist( g.ergm.markov[[k]] )$gdist[upper.tri(geodist( g.ergm.markov[[k]] )$gdist)],
nbins=10 )
}
matplot(c(1:(10)), distances ,
type ='l',
main='geodesic distribution' ,
xlab='distance',
ylab='frequency')
lines(c(1:(10)),
tabulate( geodist( padgbus.net )$gdist[upper.tri(geodist( padgbus.net )$gdist)] , nbins=10 ),
pch=24,bg="blue",lty=1,type='b',
col='red',lwd=3 )
The Markov model does a pretty spectacular work of also capturing the reach in the network
The Markov model with 4 paramters is sufficient to explain density, local clustering, as well as connectivity
We can interpret this as the degree-based effects - the edges and stars - cature mechanisms to do with popularity and activity; and, the triangle parameter captures triadic closure - friends of my friends are also my friends. There are many behvioural interpretations of these mechanisms but at the end of the day they are sufficient for explaining a lot of the network structure.
Let us try an ERGM for Kapferer’s tailors. This time, we use the statistics edges and so-called alternating triangles. Alternating triangles, or GWESP, counts the number of shared partners that two directly tied individuals have. Each additional shared partner adds to the statistic but with a geometrically decreasing weight.
g.ergm.gwesp <- simulate(KAPFTS1.net~edges+gwesp(decay = log(2) , fixed= TRUE),# one statistic
coef= c(-3.795, 1.075),# the two parameters
nsim=100,
control=control.simulate(MCMC.burnin=100000))# you need to bump up the default burnin, otherwise the networks are too similar to the starting point, the observed network
Inspect networks
par( mfrow = c(2,2) )
hist( gden( g.ergm.gwesp ) , # the density for each of the 100 random networks
xlim=range( gden( g.ergm.gwesp ),
gden( KAPFTS1.net ) ) , main='density')
abline( v=gden( KAPFTS1.net ),col='red' )# the observed density for reference
triads <- triad.census( g.ergm.gwesp , mode='graph' )
obsetriad <- triad.census(KAPFTS1.net, mode='graph' )
hist(triads[,2] ,
xlim= range( triads[,2] , obsetriad[2] ),
main = 'single edge triads')
abline( v= obsetriad[2],col='red' )# the observed density for reference
hist(triads[,3] ,
xlim= range( triads[,3] , obsetriad[3] ),
main = 'open triads')
abline( v= obsetriad[3],col='red' )# the observed density for reference
hist(triads[,4] ,
xlim= range( triads[,4] , obsetriad[4] ),
main = 'closed triads')
abline( v= obsetriad[4],col='red' )# the observed density for reference
This two-paramter model seem to capture the local clustering well.
An overall cost for ties (edges) and a tendency for people that have shared friends to also be friends explains most of the network strucure.
Based on these experiments:
Explore what other features the random graphs may or may not model
Pick a network, pick some ERGM terms by reference to ergm.terms (use the help function ?ergm.terms).
Try to estimate a model and then simulate from it. How did I get my parameter values? For Padgett is got the coefficients:
# don't run
ans <- ergm(padgbus.net~edges+kstar(2) +kstar(3) + triangles)#
summary(ans)# this produces an ANOVA table with coefficients
And for Kapferer:
# don't run
ans <- ergm(KAPFTS1.net~edges+gwesp(decay = log(2) , fixed= TRUE))#
summary(ans)# this produces an ANOVA table with coefficients
Can you pick any statistic arbitrarily to include in the model?
There are at least three reasons the answer to this is no:
For standard surveys, respondents are sampled independently of each other and we do not get any network data. The one piece of network information we can get from a surevey is the number of contacts a respondent has - this is the degree of the respondent.
How much can we say about a network and a persons network position based only on independent reports?
Survey data may provide egocentric, as opposed to socio-centric, network data. We call this egonetworks (Crossley et al. 2015).
In the US General Social Survey, respondents have been asked to nominate a range of different contacts. Here, we will
The last sections (Fitting a distribution and Measurement error) provide brief introductions to more advanced topics.
We are going to read in a stata file and for that we need the package
library('foreign')
library('haven')# this package is needed to read Stata version 5-12 .dta file
library('labelled')# this is needed to undo all the annoying things haven does
Download the General Social Survey data from the web first time around. Going forward you may want to save it locally. You can download data from https://gssdataexplorer.norc.org/ and read in the downloaded dataset (also the GSS for 1985) but here we will read the data straight from the url:
#GSSdata <- read.dta( 'https://osf.io/hn5b9/download' , convert.factors=FALSE )
GSSdata <- as.data.frame( read_dta( 'https://osf.io/npjv3/download' ), stringsAsFactors=FALSE )
val_labels(GSSdata) <- NULL
For r'read.dta', by setting convert.factors=FALSE we are
using numerical codes for categorical variables
The dataset is a
class(GSSdata)
## [1] "data.frame"
with dimensions
dim(GSSdata)
## [1] 2812 1180
constituting 2812 cases and 1181 variables. All variables are listed on http://www.thearda.com/Archive/Files/Codebooks/GSS2004_CB.asp
Basic variables include
| Number in list | variable name | name in GSSdata | description |
|---|---|---|---|
| (3) | WRKSTAT | wrkstat | employment status |
| (12) | MARITAL | marital | marital status |
| (24) | PAOCC80 | PAOCC80 | father’s and mothers occupation |
| (27) | MAOCC80 | MAOCC80 | mothers occupation |
| (39) | DEGREE | degree | education |
| (43) | SEX | sex | sex of respondent |
| (33) | AGE | age | age of respondent |
There are many measures of attritudes, for example political views (73 POLVIEW), towards government spending (74-101), race, racism, and ethnicity, religion etc. In general there are also a lot of questions about third parties, such as the household and household members. Many questions consists of a long battery of targeted attitudes, such as ‘confidence’ in questions (163-176).
There are a variety of Psychological wellbeing measures
| Number in list | variable name | name in GSSdata | description |
|---|---|---|---|
| (157) | HAPPY | happy | A happiness measure |
| (160) | LIFE | life | Excitement with Life |
| (459) | MNTLHLTH | mntlhlth | self-assesed (negative) mental health |
| (960) | HLTH2 | HLTH2 | If underwent counseling for mental or emotional problems |
While there is no specific documentation or listing of scales used in the survey, we recognise a number of questions as items from other scales.
| Number in list | Scale | no. items | reference |
|---|---|---|---|
| (161 - 162) | Trust in People Scale | 3 | Yamagishi (1986) |
| (462-468) | Empathic Concern subscale of interpersonal reactivity index | 7 | M. H. Davis et al. (1980) |
| (515-519) | The Rosenberg Self-Esteem Scale | 5 | Rosenberg (2015) |
| (520-525) | Subscale of Life orientation scale | 6 | Scheier, Carver, and Bridges (1994) |
| (524) | OWNDOING, locus of control | 1 | Rotter (1966) |
| (804-805) | Trust in People Scale | 2 | see e.g. Yamagishi (1986) |
It is worth looking through http://www.thearda.com/Archive/Files/Codebooks/GSS2004_CB.asp searching for terms of interest.
The 2004 (and some earlier surveys) contain a number of network and network-related variables.
| Variable name | Description | Note |
|---|---|---|
| partners | no. sex partners past year | listed as PARTNERS in help file. Vars 931-938 further breakdowns |
| numwomen | no. female sex partners | listed as NUWOMEN in help file |
| nummen | no. male sex partners | listed as NUMMEN in help file |
| PARTNRS5 | no. sex partners in past 5 years | binned |
| numgiven | no. close contacts | truncated at 6+. |
| numcnt | no. people keep in contact with | this is other than family and work |
| 434-437 | mode of contact | keeping in contact with f-t-f, letters, email, etc |
These are summaries of a (population) sexual network and a (population) close contact network (a number of related network questions address different aspect of social interaction).
Note: The variable HAPPY, the Trust (161 - 162) and EC (462-468) scales, are not asked of people that provided ‘numgiven’. Therefore it is important to check cross tabulations before doing an analysis, for example, tabulating happy against numgiven demonstrates that there are no respondents with values on both.
table(GSSdata$happy,GSSdata$numgiven,useNA='always')
There were 6 versions of the survey administered and not all of them included all questions, for example
table(GSSdata$version,GSSdata$happy, useNA='always')
table(GSSdata$version,GSSdata$numgiven, useNA='always')
The networks of sexual contacts has received a lot of attention in public health, see for example Morris (2004). The number of sexual contacts also serves as a good illustration for modelling the degree distribution of a network from survey data.
The number of sex partners is partially reported as binned and the missing data code is 99
sexpart <- GSSdata$partners
sexpart[sexpart %in% c(9,989,99)] <- NA # missing data codes
sexpart[sexpart == 5 ] <- 7.5 # middle of range (4,10]
sexpart[sexpart == 6 ] <- 15 # middle of range (10,20]
sexpart[sexpart == 7 ] <- 60.5 # middle of range (20,100]
sexpart[sexpart == 8 ] <- 100 # 100+
The distribution is now
hist(sexpart)
And we can look at difference between male and female
boxplot(sexpart~ GSSdata$sex)
However, testing the difference in number of sex partners between men an women, there is a clear difference
sexreg <- lm(sexpart~ GSSdata$sex==1)
summary(sexreg)
##
## Call:
## lm(formula = sexpart ~ GSSdata$sex == 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.989 -0.989 -0.083 -0.083 98.011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0831 0.1589 6.814 1.22e-11 ***
## GSSdata$sex == 1TRUE 0.9055 0.2344 3.863 0.000115 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.472 on 2192 degrees of freedom
## (618 observations deleted due to missingness)
## Multiple R-squared: 0.006763, Adjusted R-squared: 0.00631
## F-statistic: 14.92 on 1 and 2192 DF, p-value: 0.0001151
An equivalent way of testing this difference is by doing a t-test
t.test( sexpart~ GSSdata$sex==1, var.equal = TRUE)
##
## Two Sample t-test
##
## data: sexpart by GSSdata$sex == 1
## t = -3.8633, df = 2192, p-value = 0.0001151
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## -1.3651153 -0.4458451
## sample estimates:
## mean in group FALSE mean in group TRUE
## 1.083122 1.988603
Note that the t-statistics are identical for the regression coefficient and in the t-test.
If these are the degree distributions in an undirected network, how come that there are such differences? (see also Spiegelhalter 2015).
Because of this discrepancy and the binning we will focus solely on the number of sexual partners of men in the sequel.
We are going to define \(Y_i\) as the number of sex partners of a male \(i\).
The variables \(Y_i\) are the out-degrees of men in the network of sexual partners
We will model \(Y_i\) using
Recode missing values for number of female sex partners and for the variable pre-marital sex
numwom <- GSSdata$nuwomen
numwom[numwom > 993 ] <- NA # missing data codes
is.man <- GSSdata$sex==1
premarsx <- GSSdata$premarsx
premarsx[premarsx > 4] <- NA
We assume \[ Y_i = \beta_0 + \beta_1 x_{i} +\epsilon_i \] for \(x_{i}\) being attitude to pre-marital sex and \(\epsilon_i \overset{i.i.d}{\thicksim}N(0,\sigma^2)\).
The OLS estimates are given by
y <- numwom[ is.man ]# this for plotting reasons below
x <- premarsx[is.man ]# ditto
not.use <- is.na(y) | is.na(x)
y <- y[not.use==FALSE]
x <- x[not.use==FALSE]
nuwom.reg <- lm( y ~ x)
summary( nuwom.reg )
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.21 -37.21 -22.21 -5.63 956.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.901 18.242 -0.159 0.8737
## x 12.529 5.548 2.258 0.0246 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 121.6 on 337 degrees of freedom
## Multiple R-squared: 0.01491, Adjusted R-squared: 0.01198
## F-statistic: 5.1 on 1 and 337 DF, p-value: 0.02457
Check the residuals to see if the normality assumption appears satisfied
par(mfrow=c(1,2))
hist(nuwom.reg$residuals)
plot(x=predict(nuwom.reg), y= y,
xlab='Predicted Values',
ylab='Actual Values',
main='Predicted vs. Actual Values')
abline(a=0, b=1)
Residuals are clearly highly skewed and
While the residuals are very skewed (and not normal), the logarithm of the residuals might look different (we would have to do some transformation as we cannot take the log of non-positive values).
Outcome variable needs to be strictly positive
is.pos <- y>0
Letting \(Z_i=\log(Y_i)\), we estimate \[ Z_i = \alpha +\gamma X_{i} +\xi_i \]
log.nuwom.reg <- lm( log(y[is.pos]) ~ x[is.pos])
summary(log.nuwom.reg)
##
## Call:
## lm(formula = log(y[is.pos]) ~ x[is.pos])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4972 -1.3986 -0.1946 1.0435 4.7534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.08942 0.24140 4.513 9.04e-06 ***
## x[is.pos] 0.35195 0.07288 4.829 2.14e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.522 on 315 degrees of freedom
## Multiple R-squared: 0.06894, Adjusted R-squared: 0.06598
## F-statistic: 23.32 on 1 and 315 DF, p-value: 2.141e-06
par(mfrow=c(1,2))
hist(log.nuwom.reg$residuals)
plot(x=predict(log.nuwom.reg), y= log(y[is.pos]),
xlab='Predicted Values',
ylab='Actual Values',
main='Predicted vs. Actual Values')
abline(a=0, b=1)
The orginal degrees \(Y_i\) are clearly counts and the Poisson distribution may be a better model for counts \[ E(Y_i \mid x_i ) = \lambda_i,\;\Pr(Y_i=h\mid x_i)=e^{-\lambda_i}\frac{\lambda_i^h}{h!},{\text{ and }}\log(\lambda_i)=\eta_i=\beta_0+\beta_1x_i \]
We estimate \(\beta_0\) and \(\beta_1\) using r'glm'
reg.po <- glm( y ~ x, family = poisson(link='log') )
summary( reg.po )
##
## Call:
## glm(formula = y ~ x, family = poisson(link = "log"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -9.830 -7.187 -4.620 -1.379 70.675
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.00319 0.03905 51.3 <2e-16 ***
## x 0.46863 0.01065 44.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 38576 on 338 degrees of freedom
## Residual deviance: 36062 on 337 degrees of freedom
## AIC: 37354
##
## Number of Fisher Scoring iterations: 7
In addition to standard statistical distributions, a number of models for networks have been proposed that give rise to distributions for the degree distribution. One such model, is the mathematical mechanism preferential attachment which is claimed to give rise to a degree distribution that is distributed as a power-law (Barabási and Albert 1999).
Liljeros et al. (2001) make very strong claims about the distribution of sex partners - take-home point is ‘is degree the only determinant of connectivity and spread?’
\[ \Pr(Y=k) \propto k^{-\alpha} {\text{ , for }} k > k_c \]
Drawing on epidemiology that suggests that any disease spread on the network becomes epidemic if \(\alpha \leq 3\).
In their paper, Barabási and Albert (1999), they calim extensively that that the power-law form proves that the only mechanims that is at play in networks is preferential attachement (wrong because it isn’t and there is not a one-to-one between the power-law and preferential attachement)(Robins G. L., Pattison, and Koskinen 2008)
The package r`'poweRlaw' can be used to fit different
functional forms to the degree distribution. To use it first time,
install it.
install.packages('poweRlaw')
Load the package
library('poweRlaw')
The power-law does not straightforwardly allow explantory variables so let us simply re-estimate the log-normal and Poissson without predictors for comparison. Note, the power-law cannot handle zeros, so now we estimate the degree distributions conditional on \(Y_i>0\)
m_pl = displ$new( y[is.pos] ) # fit powerlaw
est = estimate_xmin(m_pl)# get estimates
m_pl$setXmin(est)# set the minimum vaulue
m_ln = dislnorm$new( y[is.pos] ) # fit lognormal dist
est_ln = estimate_xmin(m_ln) # set minimum value
m_ln$setXmin(est_ln)
m_pois = dispois$new( y[is.pos] ) # poisson
m_pois$setXmin(1)
est_pois = estimate_pars(m_pois)
m_pois$setPars(est_pois)
Note that m_pl$setXmin estimates the value of degree \(k_c\) from which the power-law holds.
The program allows us to plot the fitted distrubution, more specifically the ‘survival functions’ \(\Pr(Y > k)\) on a log-log scale.
plot(m_pl)# plots the raw data
lines(m_pl, col='red') # plots the power law
lines(m_ln, col='blue') # plots the lognormal
lines(m_pois, col='green') # plots the poisson
What distribution seems to best fit the number of female sex partners? Two things to note: Poisson has been forced to ignore 0 and, secondly, the power-law has ignored all the information below \(k_c\)
Fischer (2009) and Bailey and Marsden (1999) and others (e.g. Marsden 2003; Putnam 2001; McPherson, Smith-Lovin, and Brashears 2006) have used various waves of the GSS to investigate ‘the size’ of peoples’ networks. The size of peoples’ networks has been the target of intense speculation since Robin Dunbar (1992) proposed that there was an evolutionary limit on the number of close contacts that humans (or primates?) can retain in their brain.
The key outcome varaible in these studies has been the number of named core discussion partners
table( GSSdata$numgiven , useNA= 'always')
##
## 0 1 2 3 4 5 6 9 <NA>
## 397 281 263 232 128 96 70 5 1340
The value 9 is a missing data code and recall that the number of named contacts is truncated at 6 - a respondent that nominated 6 or more has numgiven coded as 6 but was only allowed to provide 5 names.
As for a sociocentric network, we can plot the (truncated) degree distribution.
GSSdata$numgiven[GSSdata$numgiven>6] <- NA
plot( table( GSSdata$numgiven ) )
Without knowing the structure of the population network we can still estimate the degree distribution
As we did for sexual partners, we may try to explain the number of close contacts by using statistical tests. While both sexual partners and close contacts both have skewed degree distributions, there is one notably difference in GSS, namely the range of values. Close contacts range from zero to 6 and sexual partners range from 0 to 900 or so. This limits the use of
Close contacts might act as social support and alleviate mental health issues (Bryant et al. 2017). Can we test whether better mental health is associated with having more social support?
Create a binary variable indicating whether a respondent has experienced 1 or more days of mental ill health in the past year.
GSSdata$mntlhlth[ GSSdata$mntlhlth > 30 ] <- NA # set as missing
illhealth <- GSSdata$mntlhlth>0 # this will indicate one day or more
Again, assuming random normal errors
reg.close.lm <- lm(GSSdata$numgiven ~ illhealth )
summary( reg.close.lm )
##
## Call:
## lm(formula = GSSdata$numgiven ~ illhealth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2925 -1.8109 -0.2925 1.1891 4.1891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.81087 0.08684 20.854 < 2e-16 ***
## illhealthTRUE 0.48167 0.11679 4.124 4.05e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.786 on 944 degrees of freedom
## (1866 observations deleted due to missingness)
## Multiple R-squared: 0.0177, Adjusted R-squared: 0.01666
## F-statistic: 17.01 on 1 and 944 DF, p-value: 4.047e-05
Using the log-link function
reg.close.po <- glm(GSSdata$numgiven ~ illhealth, family = poisson(link='log') )
summary( reg.close.po )
##
## Call:
## glm(formula = GSSdata$numgiven ~ illhealth, family = poisson(link = "log"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1413 -1.9031 -0.1976 0.8066 2.4489
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.59381 0.03613 16.435 < 2e-16 ***
## illhealthTRUE 0.23585 0.04625 5.099 3.42e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1750.4 on 945 degrees of freedom
## Residual deviance: 1724.0 on 944 degrees of freedom
## (1866 observations deleted due to missingness)
## AIC: 3687.2
##
## Number of Fisher Scoring iterations: 5
par(mfrow=c(1,2))
plot(x=predict(reg.close.lm), y= GSSdata$numgiven[!is.na(GSSdata$numgiven) & !is.na(illhealth)],
xlab='Predicted Values',
ylab='Actual Values',
main='OLS Predicted vs. Actual Values')
abline(a=0, b=1)
plot(x=reg.close.po$fitted.values, y= GSSdata$numgiven[!is.na(GSSdata$numgiven) & !is.na(illhealth)],
xlab='Predicted Values',
ylab='Actual Values',
main='Po Predicted vs. Actual Values')
abline(a=0, b=1)
The predicted values are just \(\hat{\beta}_0\) and \(\hat{\beta}_0+\hat{\beta}_1\), and \(e^{\hat{\beta}_0}\) and \(e^{\hat{\beta}_0+\hat{\beta}_1}\), for the OLS and Poisson, respectively. Since the Poisson does not predict negative values, the fit is better
par(mfrow=c(2,1))
plot(table(GSSdata$numgiven[illhealth==0]),ylab='',main='Degree distribution healthy')
abline(v=min(reg.close.po$fitted.values),col='red')
lines(mean(GSSdata$numgiven[illhealth==0],na.rm=TRUE),0,pch=15,type='p')
plot(table(GSSdata$numgiven[illhealth==1]),ylab='',main='Degree distribution long term un-healthy')
abline(v=max(reg.close.po$fitted.values),col='red')
lines(mean(GSSdata$numgiven[illhealth==1],na.rm=TRUE),0,pch=15,type='p')
All the usual statistical tests are avaialble in R. For example ANOVA https://www.statmethods.net/stats/anova.html.
In addition to naming alters, each ego has also provided some basic information about each of their named alters
Ego-alter tie
Here we will
We will use the GSS 2004
The number of named alters per person is given by the variable
numgiven, which is the number of close contacts, truncated
at 6+. For each of these named contacts, ego is providing descriptions
in the form of
| Variable name | Description | Note |
|---|---|---|
| SEX1 | sex of first named alter | this is NA if less than one named |
| \(\vdots\) | ||
| SEX5 | sex of fifth named alter | this is NA if less than five named |
| RACE1 | sex of first named alter | this is NA if less than one named |
| \(\vdots\) | ||
| RACE5 | sex of fifth named alter | this is NA if less than five named |
| TALKTO1 | Talks to first person important matters | this is NA if less than one named |
| \(\vdots\) | ||
| TALKTO5 | Talks to first person important matters | this is NA if less than five named |
| EDUC1 | education of first named alter | this is NA if less than one named |
| \(\vdots\) | ||
| EDUC55 | education of fifth named alter | this is NA if less than five named |
There are also a hoast of questions about commong group membership (e.g. GRPBOTH1 through GRPBOTH5) and religion, etc.
Attributes for named alters are reported in the wide format: for each respondent values are reported in ascending order by alter id, variable by variable.
Respondent has not named any alters, so there are no alter properties reported
GSSdata[1,261:270]
Respondent has named exactly one alters, so there are only varible values for ‘SEX1’, ‘RACE1’, etc
GSSdata[11,261:270]
GSSdata$SEX1[11]
Respondent has named exactly three alters, so there are three reports for each of the variables ‘SEX1’, ‘RACE1’, etc.
GSSdata[13,261:270]
These 3 alters are all female
GSSdata[13,271:275]
Respondent has named exactly four alters, so there are four alter values to report on each variable
GSSdata[16,271:275]
GSSdata[16,301:305]
GSSdata[16,306:310]
For egos 13, 16 and 42, the age of their alters is given in
GSSdata[c(13,16,42),c(346:350)]
we can thus get the average age in each egonet using
GSSdata$numgiven[GSSdata$numgiven>6] <- NA
ave.age <- rowSums(GSSdata[,c(346:350)], na.rm=TRUE)/GSSdata$numgiven
Compare the age of ego with
plot( GSSdata$age[GSSdata$numgiven %in% c(1:5)] , ave.age[GSSdata$numgiven %in% c(1:5)])
Homophily (McPherson, Smith-Lovin, and Cook 2001) is the process of people choosing contacts that are like themselves. If there is homophily on age, we wold expect the averge age of alter to be predicted to the age of ego
reg.age <- lm( ave.age[GSSdata$numgiven %in% c(1:5)] ~ GSSdata$age[GSSdata$numgiven %in% c(1:5)] )
summary(reg.age)
##
## Call:
## lm(formula = ave.age[GSSdata$numgiven %in% c(1:5)] ~ GSSdata$age[GSSdata$numgiven %in%
## c(1:5)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.966 -7.673 -0.745 5.726 60.715
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.91781 1.06635 22.43 <2e-16
## GSSdata$age[GSSdata$numgiven %in% c(1:5)] 0.51310 0.02212 23.19 <2e-16
##
## (Intercept) ***
## GSSdata$age[GSSdata$numgiven %in% c(1:5)] ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.59 on 998 degrees of freedom
## Multiple R-squared: 0.3502, Adjusted R-squared: 0.3496
## F-statistic: 537.9 on 1 and 998 DF, p-value: < 2.2e-16
We can also investigate if there is homophily on race. Race is coded differently for respondents and alters so we need to recode
ego.white <- GSSdata$race==1
alter.white <- GSSdata[,c(276, 277 , 278 , 279 , 280)]==4
prop.alter.white <- rowSums(alter.white, na.rm=TRUE)/GSSdata$numgiven
boxplot(prop.alter.white ~ ego.white )
There is clear evidence of homophily on race.
In terms of access to resources, it may be important to have access to different types of resources. For each ego we can tabulate the number of alters in each category
alter.education <- cbind(
rowSums(GSSdata[,c(341:345)]==0, na.rm=TRUE),
rowSums(GSSdata[,c(341:345)]==1, na.rm=TRUE),
rowSums(GSSdata[,c(341:345)]==2, na.rm=TRUE),
rowSums(GSSdata[,c(341:345)]==3, na.rm=TRUE),
rowSums(GSSdata[,c(341:345)]==4, na.rm=TRUE),
rowSums(GSSdata[,c(341:345)]==5, na.rm=TRUE),
rowSums(GSSdata[,c(341:345)]==6, na.rm=TRUE),
rowSums(GSSdata[,c(341:345)]==7, na.rm=TRUE),
rowSums(GSSdata[,c(341:345)]==8, na.rm=TRUE))
Based on this we can calculate the proportions \(p_j\) of alters in category \(j\).
alter.education <- alter.education / rowSums(alter.education)# convert to proportions
Simpsons dissimilarity index \[ H = \sum_{j=1}^R p_j \] (Simpson 1949; see discussion of simmilarity indices for networks in Stys et al. 2020)
diss.index <- rowSums(alter.education^2,na.rm=TRUE)
Long format means that we code each ego-alter variable as a separate observation. Manually we can create a new data frame where alter-id is implied by ‘_1’, ‘_2’, etc.
d1 <- data.frame(id = GSSdata$id,
egosex = GSSdata$sex,
egoage = GSSdata$age,
sex_1 = GSSdata$SEX1,
sex_2 = GSSdata$SEX2,
sex_3 = GSSdata$SEX3,
sex_4 = GSSdata$SEX4,
sex_5 = GSSdata$SEX5,
talkto_1 = GSSdata$TALKTO1,
talkto_2 = GSSdata$TALKTO2,
talkto_3 = GSSdata$TALKTO3,
talkto_4 = GSSdata$TALKTO4,
talkto_5 = GSSdata$TALKTO5,
age_1 = GSSdata$AGE1,
age_2 = GSSdata$AGE2,
age_3 = GSSdata$AGE3,
age_4 = GSSdata$AGE4,
age_5 = GSSdata$AGE5,
cowork_1 = GSSdata$COWORK1,
cowork_2 = GSSdata$COWORK2,
cowork_3 = GSSdata$COWORK3,
cowork_4 = GSSdata$COWORK4,
cowork_5 = GSSdata$COWORK5
)
This is still wide-format, but it is a data frame that only has the subset of variables that we chose. To convert it to ‘long format’, we use the reshape function
egoalter <- reshape(d1, # our wide format data frame
dir = "long", # indicate that we want to transfrom into 'long' format
varying = 4:23, # variables 4 through 23 vary across egos
sep = "_") # this is the charachter used to indicate what ego aler belongs to
# This reshape function does not work on 'haven'-formatted data with values
Have a look at the new, long-format data frame
head(egoalter)
## id egosex egoage time sex talkto age cowork
## 502.1 502 2 30 1 NA NA NA NA
## 578.1 578 1 32 1 NA NA NA NA
## 600.1 600 2 41 1 NA NA NA NA
## 863.1 863 2 39 1 NA NA NA NA
## 951.1 951 1 61 1 NA NA NA NA
## 991.1 991 1 35 1 NA NA NA NA
Recode talkto to a binary variable capturing ‘almost daily’
egoalter$talkto[egoalter$talkto > 4] <- NA
egoalter$talkto[egoalter$talkto>1] <- 0
Consider now modelling talk to almost daily as a binary variable \(Y_{ij}\), where \(i\) is ego and \(j\) is alter \[ \mathrm{logit} (\Pr(Y_{ij}=1|z_i,x_j))=\alpha+\beta z_i +\gamma x_i + \nu_i, \] where \(\nu_i \thicksim N(0,\tau^2)\) is a random effect for ego \(i\) that is shared across all alters of \(i\).
To do this using a logistic link function we need the package
lme4
library(lme4)
reg.talkto <- glmer(talkto ~ egoage+egosex+sex+age+cowork+(1| id), # regression formula
data= egoalter, # name of the data frame that holds the variables
family= binomial)
summary( reg.talkto )
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: talkto ~ egoage + egosex + sex + age + cowork + (1 | id)
## Data: egoalter
##
## AIC BIC logLik deviance df.resid
## 3759.1 3800.7 -1872.5 3745.1 2829
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.0978 -0.8585 0.5168 0.7883 1.7916
##
## Random effects:
## Groups Name Variance Std.Dev.
## id (Intercept) 0.6548 0.8092
## Number of obs: 2836, groups: id, 1064
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.783526 0.350290 5.092 3.55e-07 ***
## egoage -0.015355 0.003378 -4.545 5.50e-06 ***
## egosex 0.358715 0.101995 3.517 0.000436 ***
## sex 0.211911 0.088154 2.404 0.016223 *
## age -0.009880 0.003076 -3.212 0.001318 **
## cowork -0.692913 0.136942 -5.060 4.19e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) egoage egosex sex age
## egoage -0.299
## egosex -0.357 0.047
## sex -0.274 0.002 -0.125
## age -0.194 -0.424 -0.022 -0.015
## cowork -0.668 0.007 -0.086 -0.068 -0.024
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.00246418 (tol = 0.002, component 1)
## Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?
We are looking at the s50 dataset, which is further described here: https://www.stats.ox.ac.uk/~snijders/siena/s50_data.htm
This dataset is available in ziped format online.
temp <- tempfile()
download.file("https://www.stats.ox.ac.uk/~snijders/siena/s50_data.zip",temp)
X <- as.matrix( read.table(unz(temp, "s50-network1.dat")) )
sport <- read.table(unz(temp, "s50-sport.dat"))
smoke <- read.table(unz(temp, "s50-smoke.dat"))
alcohol <- read.table(unz(temp, "s50-alcohol.dat"))
unlink(temp)
n <- nrow(X)
smoke <- (smoke[,2] %in% c(2,3))+0 # use wave 2 and set values of 2 and 3 to smoking and 1 to non-smoking
sport <- sport[,1]-1# dichotomise sporty
The variable Alcohol use is coded as
| Alcohol value | meaning |
|---|---|
| 1 | non |
| 2 | once or twice a year |
| 3 | once a month |
| 4 | once a week |
| 5 | more than once a week |
We are now going to investigate if your friends alcohol use tends to influence your alcohol use. Start with plotting the networks with
gplot(X,vertex.cex=alcohol[,2]/2, vertex.col = ifelse(smoke==1,'red','green') )
Looking at the figure, do you see any evidence of social influence?
Big nodes seem to be tied to other big nodes and small nodes to other small nodes. Smokers also seem to hang out with other smokers.
Make the assumption that the levels of smoking can be treated as a interval-level, continuous variable. To model the outcomes, we start with assuming that outcomes are independent across students using a regression model
\[ Y_i = \beta_0+\beta_1 m_{i,smoke} + \beta_2 m_{i,sport}+\epsilon_i \]
where * \(\beta_0\) is the intercept * \(\beta_1\) is the average difference in alcohol use for smokers relative to non-smokers * \(\beta_2\) is the average difference in alcohol use for sporty people relative to non-sporty people
and the \(\epsilon_i\)’s are assumed to be independent across \(i\) and follow a normal distribution \(N(0,\sigma^2)\).
y <- alcohol[,2]
ans.ols <- lm(y ~ smoke+sport)
summary(ans.ols)
##
## Call:
## lm(formula = y ~ smoke + sport)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0922 -0.6918 0.1690 0.8232 2.5694
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4306 0.3169 7.669 7.98e-10 ***
## smoke 1.4004 0.3084 4.541 3.90e-05 ***
## sport 0.2612 0.3331 0.784 0.437
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.021 on 47 degrees of freedom
## Multiple R-squared: 0.305, Adjusted R-squared: 0.2755
## F-statistic: 10.31 on 2 and 47 DF, p-value: 0.0001933
What conclusions can you draw from the ANOVA table of the OLS regression regarding the regressors smoke and sport?
Smokers have on average a 1.4 higher score on drinking and the coefficient is significantly different from 0 at the 1%-level (and more). There is no evidence of sporty people drinking more or less than less sporty people as the coefficient is not significantly different from 0.
We assumed that the errors were normally distributed so the residuals
\[ \hat{e}_i= y_i - \hat{y}= y_i - \hat{\beta}M_i^{\top} \]
should be normally distributed:
hist(ans.ols$residuals)
The residuals seem reasonably ‘normal’
We are now going to investigate other properties of the model.
For the sna package, we need to put all covariates,
including the constant, in a matrix
\[ M = \begin{pmatrix} M_{1}\\ M_{2}\\ \vdots\\ M_{n}\\ \end{pmatrix}= \begin{pmatrix} 1 & m_{1,1} & \cdots & m_{Z,p}\\ 1 & m_{2,1} & \cdots & m_{Z,p}\\ \vdots & \vdots & \vdots& \vdots\\ 1 & m_{n,1} & \cdots & m_{n,p}\\ \end{pmatrix} \]
We thus put all covariates into the same matrix:
M <- cbind(matrix(1,n,1),smoke,sport)
colnames(M) <- c('cons','smoke','sport')
head(M)
## cons smoke sport
## [1,] 1 0 1
## [2,] 1 1 0
## [3,] 1 0 0
## [4,] 1 0 1
## [5,] 1 0 1
## [6,] 1 0 1
Let us investigate whether the residuals are independent across the network. In particular, if the outcome of \(i\) is completely independent of the outcome of \(j\) then we would not expect there to be a difference between \((\hat{e}_i - \bar{e})(\hat{e}_j-\bar{e})\) for a pair that is connected through a ties \(x_{ij}=1\) and a pair \((\hat{e}_i - \bar{e})(\hat{e}_k-\bar{e})\) that is not connected, \(x_{ik}=0\). In other words, higher (lower) than predicted values should not be expected if a network partner has higher (lower) than predicted value. Note that the average \(\bar{e}=0\) by construction and was just included for clarity. Moran’s I does measure exactly whether connected partners tend to have more similar residuals than unconnected people. In terms of the outcome variable
\[ I_k =\frac{n}{\sum_{i=1}^n\sum_{j=1}^n X_{ij}^{(k)}} \frac{\sum_{i=1}^n \sum_{j=1}^n (y_i-\bar{y}) (y_j-\bar{y})X_{ij}^{(k)} }{\sum_{j=1}^n (y_j-\bar{y})^2} \]
Where \(X_{ij}^{(k)}\) is the \(k\)-step adjacency matrix, i.e. \(X_{ij}^{(1)}\) is the matrix of people that are directly connected, \(X_{ij}^{(2)}\) is the matrix of people that are connected at distance 2, etc. The step \(k\) is sometimes called lag. Intuitively, if \(X_{ij}^{(k)}\) doesn’t matter, then many cross-product terms should cancel others out. It can be shown that, under the assumption that there is no network correlation at lag \(k=1\), then the expected value is
\[ E(I_1) = \frac{-1}{N-1} \]
Here we want to check if this seems to hold true for our residuals - are they uncorrelated across network partners?
nacf(X,ans.ols$residuals,type="moran")
## 0 1 2 3 4 5
## 1.00000000 0.19495562 -0.17048356 0.02784058 -0.04684902 -0.01678792
## 6 7 8 9 10 11
## 0.03002702 -0.06753150 -0.10455140 0.00000000 0.00000000 0.00000000
## 12 13 14 15 16 17
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## 18 19 20 21 22 23
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## 24 25 26 27 28 29
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## 30 31 32 33 34 35
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## 36 37 38 39 40 41
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## 42 43 44 45 46 47
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## 48 49
## 0.00000000 0.00000000
# nacf(X,ans.ols$residuals,type="geary") # Geary's C is another measure of autocorrelation but is harder to undertand
That \(I_0=1\) is because this is the correlation of residuals with themselves!
We are only really interested in shorter lags, so let us plot the first 4 and add a reference line for \(E(I_1)\) under no correlation
plot(nacf(X,ans.ols$residuals,type="moran")[1:4],type='b')
abline(h=-1/(n-1),col='red')
When modelling social influence we may want scale the network ties to take into account how many ties people have. Leenders (2002) propose a number of ways in which we can scale the adjacency matrix. Here we want each arc to have the weight
\[ W_{ij} = X_{ij}/d_{i,out} \]
We can do this easily BUT we have to be careful so that we do not divide by 0 (we define \(0/0\) as \(0\))
degrees <- degree(X,outdegree)
W <- X
W[degrees>0,] <- X[degrees>0,]/degrees[degrees>0]
# You can check that we are actually normalising correctly
# Wcopy <- X
# for (i in c(1:n))
# {
# if (degrees[i]>0)
# {
# Wcopy[i,] <- X[i,]/degrees[i]
# }
#
#}
#sum( W != Wcopy)
Check the residuals again
plot(nacf(W,ans.ols$residuals,type="moran")[1:4],type='b')
Do we see any difference? In the formula for \(I_k\) the scaling factors \(d_i\) will actually cancel out.
The network autocorrelation model, or network disturbances model, looks exactly like the OLS \[ y_i = M_i \beta + \epsilon_i\text{,} \]
but we no longer assume that the residuals are independent. Instead, we induce network autoocorrelation on the error terms
\[ \epsilon_i = \rho \sum_{j=1}^n W_{ij}\epsilon_j+\xi_i \]
(\(\xi\) is the Greek letter Xi - https://en.wikipedia.org/wiki/Xi_(letter)). The error tems \(\xi_i\) are assumed to be independent and identically distributed \(\xi_i \thicksim N(0,\sigma^2)\). The interpretation is that if \(i\) has a higher than predicted value on the outcome variable then \(j\) is more likely to also have higher than predicted values for all \(j\) that \(i\) has nominated.
If you know a bit of matrix algebra, you will notice that we can write the vector of disturbances $= (_1,,_n)^{} $ as
\[ \epsilon = \rho W \epsilon + \xi \]
One immediate issue here is that we have \(\epsilon\) on both sides of the equation. We can simplify this expression by solving for \(\epsilon\)
\[ \epsilon = (I-\rho W)^{-1}\xi \]
You can interpret this as the error terms \(\xi_i\) ‘spreading’ on the network.
The function lnam (which stands for linear network
autocorrelation models, pressumably) can fit this model. The formula is
specified in terms of the outcome y variable and the matrix
x of covariates. The weight matrix is specified as
W2.
netacm.1 <-lnam(y=y, x=M, W2=X)
## Loading required namespace: numDeriv
summary(netacm.1)
##
## Call:
## lnam(y = y, x = M, W2 = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6676 -0.6647 0.3353 1.1311 2.5269
##
## Coefficients:
## Estimate Std. Error Z value Pr(>|z|)
## cons 2.47307 0.33278 7.431 1.07e-13 ***
## smoke 1.00287 0.35437 2.830 0.00465 **
## sport 0.19162 0.30839 0.621 0.53438
## rho2.1 0.16432 0.07098 2.315 0.02061 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimate Std. Error
## Sigma 0.9279 0.009
##
## Goodness-of-Fit:
## Residual standard error: 1.061 on 46 degrees of freedom (w/o Sigma)
## Multiple R-Squared: 0.1757, Adjusted R-Squared: 0.104
## Model log likelihood: -68.42 on 45 degrees of freedom (w/Sigma)
## AIC: 146.8 BIC: 156.4
##
## Null model: meanstd
## Null log likelihood: -79.54 on 48 degrees of freedom
## AIC: 163.1 BIC: 166.9
## AIC difference (model versus null): 16.24
## Heuristic Log Bayes Factor (model versus null): 10.51
Did the conclusions about the regressors smoke and sport change?
The difference between smokers and non-smokers remain but the effect is somewhat smaller.
Is there evidence of network autocorrelation?
The network autocorrelation (must be) \(\rho\) is here rho2.1 and is estimated to 0.16 and is statistically significantly different from zero on the 5%-level using a one-sided test.
Looking at the network plot, it almost looks as if high-degree nodes drink more. To take this into account, we add outdegree as an explanatory variable:
M <- cbind(matrix(1,n,1),degrees,smoke,sport )
colnames(M) <- c('cons','degree','smoke','sport')
head(M)
## cons degree smoke sport
## [1,] 1 3 0 1
## [2,] 1 4 1 0
## [3,] 1 4 0 0
## [4,] 1 4 0 1
## [5,] 1 2 0 1
## [6,] 1 2 0 1
Now re-run the network autocorrlation model (call the output
netacm.2 to align with the code in the code chunk
below)
netacm.2 <-lnam(y=y, x=M, W2=W)
summary(netacm.2)
##
## Call:
## lnam(y = y, x = M, W2 = W)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91118 -0.71108 0.09031 0.60982 2.34177
##
## Coefficients:
## Estimate Std. Error Z value Pr(>|z|)
## cons 2.12466 0.39409 5.391 7e-08 ***
## degree 0.08893 0.05627 1.580 0.113994
## smoke 1.07210 0.31516 3.402 0.000669 ***
## sport 0.25294 0.29605 0.854 0.392900
## rho2.1 0.66227 0.29190 2.269 0.023278 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimate Std. Error
## Sigma 0.9115 0.009
##
## Goodness-of-Fit:
## Residual standard error: 1.032 on 45 degrees of freedom (w/o Sigma)
## Multiple R-Squared: 0.2479, Adjusted R-Squared: 0.1643
## Model log likelihood: -67.27 on 44 degrees of freedom (w/Sigma)
## AIC: 146.5 BIC: 158
##
## Null model: meanstd
## Null log likelihood: -79.54 on 48 degrees of freedom
## AIC: 163.1 BIC: 166.9
## AIC difference (model versus null): 16.55
## Heuristic Log Bayes Factor (model versus null): 8.902
Does accounting for outdegree change the conclusions about network autocorrelation?
No, the coefficient for degree itselft is not statistically significantly different from 0. The magnitude of the network autocorrelation paramter is increased substantially.
The network effects model assumes that there is correlation through the outcome variable \(Y\). This subtly different from correlation only through the error term. The outcome \(Y_i\) for \(i\) is a a weighted sum of the values \(Y_j\) of the contacts of \(i\)
\[ Y_i = \rho \sum_{j=1}^n W_{ij}Y_i + M_i\beta+\epsilon_i \]
where \(\xi_i \thicksim N(0,\sigma^2)\) independently for all \(i\).
If you want to model social influence or social contagion, the network effects model is more approprate than the autocorrelation model as the former models the dependence between values on the outcome varaible
The equation for the full outcome vector can, as before, be written compactly as
\[ Y = \rho W Y + M\beta+\epsilon \]
Here it is there is feedback on the outcome variable.
We can derive this model out of a longitudinal model
\[ Y_{i,t} = \rho \sum_{j=1}^n W_{ij}Y_{j,t-1}+M\beta+\epsilon_{i,t} \]
as \(t\) tends to infinity.
For a quick illustration, let us look at the first 5 interaction of the updating accourding to the longitudinal form of the model
rho <- 0.4# set a strong influence paramter
beta <- matrix(summary(netacm.2)$beta[,1],4,1)# use the estimated coefficients from the last regression
Y <- matrix(rnorm(n),n,1)# random starting point
par(mfrow=c(2,3), oma = c(0,1,0,0) + 0.1,mar = c(1,0,1,1) + 0.1)
coord <- gplot(X,gmode='graph',vertex.cex = Y/2,main= paste('M ',round(nacf(W,Y[,1],type="moran")[2],3) ) )
for (k in c(1:5)){
Y <- rho * W %*% Y + M %*% beta + matrix(rnorm(n),n,1)# update according to equation; %*% for matrix multiplication
gplot(X,gmode='graph',vertex.cex = Y/2, coord=coord, main= paste('M ',round(nacf(W,Y[,1],type="moran")[2],3) ) )
}
We may or may not see increased corrlation in the outcomes. Running 100 iterations this will become a little more evident
moran <- matrix(0,100,3)
rho <- .4
Y <- matrix(rnorm(n),n,1)
for (k in c(1:100)){
Y <- rho * W %*% Y + M %*% beta + matrix(rnorm(n),n,1)
moran[k,] <- nacf(W,Y[,1],type="moran")[2:4]}
par(mfrow=c(1,1))
plot(moran[,1],type='l')
lines(moran[,2],col='red')
lines(moran[,3],col='blue')
abline( h = -1/(n-1), col='grey')
We might see a slight trend upwards and correlations are generally speaking positive.
As before \(Y\) on both sides of the equation, but we can solve for \(Y\)
\[ Y = (I- \rho W)^{-1}(M\beta+\epsilon) \]
The same function as before, lnam is used to fit this
model. The difference from the network autocorrelation model is that the
weight matrix is specified as W1 not
W2.
neteff.2 <-lnam(y=y, x=M, W1=W)
summary(neteff.2)
##
## Call:
## lnam(y = y, x = M, W1 = W)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.04530 -0.70707 0.08996 0.74132 2.41599
##
## Coefficients:
## Estimate Std. Error Z value Pr(>|z|)
## cons 1.78401 0.42224 4.225 2.39e-05 ***
## degree 0.07223 0.05714 1.264 0.206154
## smoke 1.14399 0.31259 3.660 0.000253 ***
## sport 0.19588 0.31403 0.624 0.532785
## rho1.1 0.29178 0.17255 1.691 0.090838 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimate Std. Error
## Sigma 0.9404 0.009
##
## Goodness-of-Fit:
## Residual standard error: 1.025 on 45 degrees of freedom (w/o Sigma)
## Multiple R-Squared: 0.3229, Adjusted R-Squared: 0.2477
## Model log likelihood: -68.05 on 44 degrees of freedom (w/Sigma)
## AIC: 148.1 BIC: 159.6
##
## Null model: meanstd
## Null log likelihood: -79.54 on 48 degrees of freedom
## AIC: 163.1 BIC: 166.9
## AIC difference (model versus null): 14.98
## Heuristic Log Bayes Factor (model versus null): 7.336
There is no evidence for social influence on alcohol
Fit a regression that only includes (the intercept) smoke and that uses the original, non-normalised, adjacency matrix
neteff.2 <-lnam(y=y, x=M[,c(1,3)], W1=X)
summary(neteff.2)
##
## Call:
## lnam(y = y, x = M[, c(1, 3)], W1 = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91250 -0.60102 -0.06224 0.75113 2.24134
##
## Coefficients:
## Estimate Std. Error Z value Pr(>|z|)
## cons 2.21840 0.25578 8.673 < 2e-16 ***
## smoke 1.10413 0.30849 3.579 0.000345 ***
## rho1.1 0.06896 0.03219 2.142 0.032153 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimate Std. Error
## Sigma 0.9503 0.009
##
## Goodness-of-Fit:
## Residual standard error: 1.013 on 47 degrees of freedom (w/o Sigma)
## Multiple R-Squared: 0.3161, Adjusted R-Squared: 0.2724
## Model log likelihood: -68.59 on 46 degrees of freedom (w/Sigma)
## AIC: 145.2 BIC: 152.8
##
## Null model: meanstd
## Null log likelihood: -79.54 on 48 degrees of freedom
## AIC: 163.1 BIC: 166.9
## AIC difference (model versus null): 17.9
## Heuristic Log Bayes Factor (model versus null): 14.07
What do you conclude from the results? Can you relate this to total and average exposure in the diffusion model?
When the the matrix is not row-normlised there is evidence of social influence. This means that it is the number of people that exert influence on you and not the proportion of friends.